Deep learning for predicting 16S rRNA gene copy number

Culture-independent 16S rRNA gene metabarcoding is a commonly used method for microbiome profiling. To achieve more quantitative cell fraction estimates, it is important to account for the 16S rRNA gene copy number (hereafter 16S GCN) of different community members. Currently, there are several bioinformatic tools available to estimate the 16S GCN values, either based on taxonomy assignment or phylogeny. Here we present a novel approach ANNA16, Artificial Neural Network Approximator for 16S rRNA gene copy number, a deep learning-based method that estimates the 16S GCN values directly from the 16S gene sequence strings. Based on 27,579 16S rRNA gene sequences and gene copy number data from the rrnDB database, we show that ANNA16 outperforms the commonly used 16S GCN prediction algorithms. Interestingly, Shapley Additive exPlanations (SHAP) shows that ANNA16 can identify unexpected informative positions in 16S rRNA gene sequences without any prior phylogenetic knowledge, which suggests potential applications beyond 16S GCN prediction.


1.2.Other Machine Learning Models Tested
In addition to the SEM model, this study also tested a Transformer-based model and a Residual Multi-layer Perceptron (ResMLP), a ResNet34 [1], and a k-nearest neighbor model model.The framework and architecture of these models are shown in Figure S1.

Transformer
For the transformer-based model in Figure S1, we leveraged the original transformer encoder architecture with cosine learning rate decay [2], followed by global average pooling, and two dense layers.To help the transformer understand the genetic sequences, the K-mers were ranked by their frequency in the training data and the ranks were assigned as the indices.
In this way, a sequence is converted into a list of tokens.As shown in Figure S1c, the transformer encoder takes embedding and position encoding as input.For biological sequences, a dense embedding is considered more effective than sparse tokenization.Each sequence is thus represented as a matrix M ∈ ℝ "×$ , by embedding each 6-mer into a numerical vector of dimension d via the embedding layer.The cosine positional encoding method is used to incorporate both absolute and relative positional information by constructing the matrix P from M [3], where each position has: The encoder includes multi-head self-attention, position-wise feed-forward network, residual connection, and layer normalization [3].With the combination of these modules, the bringing a surprise to the data science community [5].
The ResMLP model in this study is trained with Rooted Mean Squared Error (RMSE) loss:

𝑛
where  is the size of the input dataset,  ./01 is the true value of copy number,  2/1$ is the value of the predicted copy number.
Hyper-parameter tuning is performed on the number of dense blocks, the number of dense layers in a dense block, dense layer size, and learning rate.Hidden layer size is selected from the list of (128, 256, 512, 1024).The learning rate is tested from 0.001 to 0.002, with 0.0002 being a step.The final architecture of ResMLP for full-length DNA sequence is shown in Figure S1d & S1e.

k-Nearest Neighbor
Because each 16S sequence is represented as the count of K-mers, the k-nearest neighbor (kNN) model is added as a simple benchmark.This algorithm assigns the copy number of the closest training sequences as the copy number of a query.We implemented this algorithm by the function KNeighborsRegressor from the sklearn package in Python.The copy number of the neighbors were weighed by the Euclidean Distance.The value of k was tuned from 1 to 7, and a final value of 5 is selected.

ResNet34
ResNet34 is a 34-layer convolutional neural network, with the residual connection bridges.
This architecture is originally designed for image recognition [1], but is also found useful for protein annotation [6].Onehot encoding is applied to the 16S sequences, so that they can be treated as an "image" and input into ResNet34.The details of the architecture implemented in this study is shown in Figure S1.

Model Performance
The performance of the machine learning models is shown in Figure S2.The ResNet34 model exhibits a mean RMSE of 0.965 copies/genome (SD = 0.132), Transformer equals to 0.861 copies/genome (SD = 0.0643), kNN equals to 0.764 copies/genome (SD = 0.0440), and ResMLP equals to 0.726 copies/genome (SD = 0.0381).
Notably, the performance of Transformer is significantly inferior to SEM (p < 0.001).This outcome is unexpected because Transformer-based models, such as BERT [7] or GPT [8], have achieved substantial success in the Natural Language Processing (NLP) field.A possible explanation might be related to the computing cost of the self-attention mechanism, which increases quadratically with sequence length, rendering Transformer-based models typically unsuitable for modeling extended sequences [9].For context, in NLP, classical models normally accept inputs below 512 or 1024 [7].Since biological sequences are more complex than natural language and the input length surpasses one thousand, the subpar performance of the Transformer-based model is plausible.For future studies intending to utilize Transformerbased models to study 16S rRNA gene sequences, employing non-overlapping K-mer encoding, as was utilized in [10], might be beneficial in reducing the sequence length to a size manageable by the Transformer.The ranges tested during hyperparameter tuning were included in the parentheses.

Figure S1 :
Figure S1: Illustration of the Framework and Architecture of the Deep Learning Models.

Figure S5 .
Figure S5.Contribution of each feature extractor in ANNA16.

Figure S6 .
Figure S6.The relationship between the true copy number and prediction residues.

Figure S7 .
Figure S7.Distribution of the 16S copy number in the whole dataset.

Table S2 .
Parameters of subunits in SEM.

Table S3 .
Information of mock communities.

Table S4 .
Uncorrected, true, and estimated abundance of strains in even community.

Table S5 .
Uncorrected, true, and estimated abundance of strains in staggered community.
ResMLP model.e) Architecture of a Dense Block inside the ResMLP model.f) Architecture of the ResNet34 model.g) Architecture of the Identity Block.h) Architecture of the